Imagine there's a bank that wants to segment their customers into clusters based on repayment potential. Kind of like a credit score, but not manually calculated.
We are using data from the Lending Club, a US peer-to-peer lending company. Borrowers on the platform can borrow between $1000 to $40,000 in the form of unsecured personal loans, for a term of either three or five years.
Investors can browse the loan applications and choose to finance the loans based on the credit history of the borrower, the amount of the loan, the loan grade, and the purpose of the loan. Investors earn money through interest paid on the loans, and Lending Club makes money from loan origination fees and service charges.
The timeframe of the data is from 2019 and is publicly available on the Lending Club Website (must make an account). A data dictionary is also available.
files = glob.glob(os.path.join('data', '*.zip'))
data = pd.concat((pd.read_csv(file, compression='zip', header=1, low_memory=False) for file in files), ignore_index=True)
data.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 518115 entries, 0 to 518114 Columns: 150 entries, id to settlement_term dtypes: float64(112), object(38) memory usage: 592.9+ MB
data.shape
(518115, 150)
data.columns
Index(['id', 'member_id', 'loan_amnt', 'funded_amnt', 'funded_amnt_inv',
'term', 'int_rate', 'installment', 'grade', 'sub_grade',
...
'orig_projected_additional_accrued_interest',
'hardship_payoff_balance_amount', 'hardship_last_payment_amount',
'debt_settlement_flag', 'debt_settlement_flag_date',
'settlement_status', 'settlement_date', 'settlement_amount',
'settlement_percentage', 'settlement_term'],
dtype='object', length=150)
I don't really need all these columns, so we'll narrow them down.
print('old shape: ', data.shape)
# Select columns to keep
columnsToKeep = ['loan_amnt','funded_amnt','funded_amnt_inv','term', \
'int_rate','installment','grade','sub_grade', \
'emp_length','home_ownership','annual_inc', \
'verification_status','pymnt_plan','purpose', \
'addr_state','dti','delinq_2yrs','earliest_cr_line', \
'mths_since_last_delinq','mths_since_last_record', \
'open_acc','pub_rec','revol_bal','revol_util', \
'total_acc','initial_list_status','out_prncp', \
'out_prncp_inv','total_pymnt','total_pymnt_inv', \
'total_rec_prncp','total_rec_int','total_rec_late_fee', \
'recoveries','collection_recovery_fee','last_pymnt_d', \
'last_pymnt_amnt']
data = data.loc[:,columnsToKeep]
print('new shape: ', data.shape)
old shape: (518115, 150) new shape: (518115, 37)
data.head()
| loan_amnt | funded_amnt | funded_amnt_inv | term | int_rate | installment | grade | sub_grade | emp_length | home_ownership | ... | out_prncp_inv | total_pymnt | total_pymnt_inv | total_rec_prncp | total_rec_int | total_rec_late_fee | recoveries | collection_recovery_fee | last_pymnt_d | last_pymnt_amnt | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 5000.0 | 5000.0 | 5000.0 | 36 months | 13.90% | 170.65 | C | C1 | NaN | OWN | ... | 3164.76 | 2552.03 | 2552.03 | 1835.24 | 716.79 | 0.0 | 0.0 | 0.0 | Jun-2020 | 170.65 |
| 1 | 6500.0 | 6500.0 | 6475.0 | 36 months | 8.81% | 206.13 | A | A5 | 10+ years | OWN | ... | 0.00 | 7015.28 | 6988.30 | 6500.00 | 515.28 | 0.0 | 0.0 | 0.0 | Apr-2020 | 4754.21 |
| 2 | 24000.0 | 24000.0 | 24000.0 | 60 months | 13.90% | 557.20 | C | C1 | 10+ years | MORTGAGE | ... | 20427.69 | 7206.53 | 7206.53 | 3572.31 | 3634.22 | 0.0 | 0.0 | 0.0 | Jun-2020 | 557.20 |
| 3 | 18500.0 | 18500.0 | 18500.0 | 60 months | 14.74% | 437.60 | C | C2 | NaN | MORTGAGE | ... | 15121.23 | 6533.70 | 6533.70 | 3378.77 | 3154.93 | 0.0 | 0.0 | 0.0 | Jul-2020 | 437.60 |
| 4 | 24000.0 | 24000.0 | 24000.0 | 60 months | 20.00% | 635.86 | D | D2 | 8 years | MORTGAGE | ... | 20017.99 | 9484.57 | 9484.57 | 3982.01 | 5502.56 | 0.0 | 0.0 | 0.0 | Jun-2020 | 635.86 |
5 rows × 37 columns
# Transform features from string to numeric
for i in ["term","int_rate","emp_length","revol_util"]:
data.loc[:,i] = \
data.loc[:,i].apply(lambda x: re.sub("[^0-9]", "", str(x)))
data.loc[:,i] = pd.to_numeric(data.loc[:,i])
# Determine which features are numerical
numericalFeats = [x for x in data.columns if data[x].dtype != 'object']
nanCounter = np.isnan(data.loc[:,numericalFeats]).sum()
nanCounter
loan_amnt 8 funded_amnt 8 funded_amnt_inv 8 term 8 int_rate 8 installment 8 emp_length 46259 annual_inc 8 dti 1153 delinq_2yrs 8 mths_since_last_delinq 292561 mths_since_last_record 462994 open_acc 8 pub_rec 8 revol_bal 8 revol_util 682 total_acc 8 out_prncp 8 out_prncp_inv 8 total_pymnt 8 total_pymnt_inv 8 total_rec_prncp 8 total_rec_int 8 total_rec_late_fee 8 recoveries 8 collection_recovery_fee 8 last_pymnt_amnt 8 dtype: int64
# Impute NaNs with mean
fillWithMean = ['loan_amnt','funded_amnt','funded_amnt_inv','term', \
'int_rate','installment','emp_length','annual_inc',\
'dti','open_acc','revol_bal','revol_util','total_acc',\
'out_prncp','out_prncp_inv','total_pymnt', \
'total_pymnt_inv','total_rec_prncp','total_rec_int', \
'last_pymnt_amnt']
# Impute NaNs with zero
fillWithZero = ['delinq_2yrs','mths_since_last_delinq', \
'mths_since_last_record','pub_rec','total_rec_late_fee', \
'recoveries','collection_recovery_fee']
# Perform imputation
im = SimpleImputer(strategy='mean')
data.loc[:,fillWithMean] = im.fit_transform(data[fillWithMean])
data.loc[:,fillWithZero] = data.loc[:,fillWithZero].fillna(value=0,axis=1)
# Check for NaNs one last time
nanCounter = np.isnan(data.loc[:,numericalFeats]).sum()
nanCounter
loan_amnt 0 funded_amnt 0 funded_amnt_inv 0 term 0 int_rate 0 installment 0 emp_length 0 annual_inc 0 dti 0 delinq_2yrs 0 mths_since_last_delinq 0 mths_since_last_record 0 open_acc 0 pub_rec 0 revol_bal 0 revol_util 0 total_acc 0 out_prncp 0 out_prncp_inv 0 total_pymnt 0 total_pymnt_inv 0 total_rec_prncp 0 total_rec_int 0 total_rec_late_fee 0 recoveries 0 collection_recovery_fee 0 last_pymnt_amnt 0 dtype: int64
# Feature engineering
# Typical ratios between loan amount, revolving balance, payments,
# and the borrower's annual income
data['installmentOverLoanAmnt'] = data.installment/data.loan_amnt
data['loanAmntOverIncome'] = data.loan_amnt/data.annual_inc
data['revol_balOverIncome'] = data.revol_bal/data.annual_inc
data['totalPymntOverIncome'] = data.total_pymnt/data.annual_inc
data['totalPymntInvOverIncome'] = data.total_pymnt_inv/data.annual_inc
data['totalRecPrncpOverIncome'] = data.total_rec_prncp/data.annual_inc
data['totalRecIncOverIncome'] = data.total_rec_int/data.annual_inc
newFeats = ['installmentOverLoanAmnt','loanAmntOverIncome', \
'revol_balOverIncome','totalPymntOverIncome', \
'totalPymntInvOverIncome','totalRecPrncpOverIncome', \
'totalRecIncOverIncome']
For some reason, issues were happening later on. Found out some newly created columns were becoming 'null' aka positive infinity.
print('NaN in numericalFeats: ', np.any(np.isnan(data.loc[:,numericalFeats])))
print('NaN in newFeats: ', np.any(np.isnan(data.loc[:,newFeats])))
data.loc[:,newFeats].max()
NaN in numericalFeats: False NaN in newFeats: True
installmentOverLoanAmnt 0.04295 loanAmntOverIncome inf revol_balOverIncome inf totalPymntOverIncome inf totalPymntInvOverIncome inf totalRecPrncpOverIncome inf totalRecIncOverIncome inf dtype: float64
print('old data shape: ', data.shape)
with pd.option_context('mode.use_inf_as_na', True):
# Dropping the rows with nan
# (or inf) values
data.dropna(inplace=True)
print('NaN in numericalFeats: ', np.any(np.isnan(data.loc[:,numericalFeats])))
print('NaN in newFeats: ', np.any(np.isnan(data.loc[:,newFeats])))
print('new data shape: ', data.shape)
old data shape: (518115, 44) NaN in numericalFeats: False NaN in newFeats: False new data shape: (516108, 44)
# Select features for training
numericalPlusNewFeats = numericalFeats+newFeats
X_train = data.loc[:,numericalPlusNewFeats]
# Scale data
sX = pp.StandardScaler()
X_train.loc[:,:] = sX.fit_transform(X_train)
X_train.columns
Index(['loan_amnt', 'funded_amnt', 'funded_amnt_inv', 'term', 'int_rate',
'installment', 'emp_length', 'annual_inc', 'dti', 'delinq_2yrs',
'mths_since_last_delinq', 'mths_since_last_record', 'open_acc',
'pub_rec', 'revol_bal', 'revol_util', 'total_acc', 'out_prncp',
'out_prncp_inv', 'total_pymnt', 'total_pymnt_inv', 'total_rec_prncp',
'total_rec_int', 'total_rec_late_fee', 'recoveries',
'collection_recovery_fee', 'last_pymnt_amnt', 'installmentOverLoanAmnt',
'loanAmntOverIncome', 'revol_balOverIncome', 'totalPymntOverIncome',
'totalPymntInvOverIncome', 'totalRecPrncpOverIncome',
'totalRecIncOverIncome'],
dtype='object')
data.grade.value_counts(dropna=False)
A 165175 B 150906 C 121873 D 74726 E 3375 F 36 G 17 Name: grade, dtype: int64
# Designate labels for evaluation
labels = data.grade
labels.unique()
array(['C', 'A', 'D', 'B', 'E', 'G', 'F'], dtype=object)
# Fill missing labels if there are any
labels = labels.fillna(value="Z")
# Convert labels to numerical values
lbl = pp.LabelEncoder()
lbl.fit(list(labels.values))
labels = pd.Series(data=lbl.transform(labels.values), name='grade')
# Store as y_train
y_train = labels
# View the changes
labelsOriginalVSNew = pd.concat([labels, data.grade],axis=1)
labelsOriginalVSNew.head(5)
| grade | grade | |
|---|---|---|
| 0 | 2.0 | C |
| 1 | 0.0 | A |
| 2 | 2.0 | C |
| 3 | 2.0 | C |
| 4 | 3.0 | D |
# Compare loan grades with interest rates
interestAndGrade = pd.DataFrame(data=[data.int_rate,labels])
interestAndGrade = interestAndGrade.T
interestAndGrade.groupby("grade").mean()
| int_rate | |
|---|---|
| grade | |
| 0.0 | 1285.583177 |
| 1.0 | 1286.537482 |
| 2.0 | 1288.398079 |
| 3.0 | 1290.670952 |
| 4.0 | 1243.752083 |
| 5.0 | 1351.028571 |
| 6.0 | 1209.294118 |
Now that we have X_train with all 34 numberical features and y_train with numerical loan grades (used only to validate the results).
mask = np.triu(X_train[list(X_train.columns)].corr())
plt.figure(figsize=(25,16))
sns.heatmap(X_train[list(X_train.columns)].corr(), annot=True, fmt='.2g', vmin=-1, vmax=1, center=0, cmap='coolwarm', mask=mask)
# plt.ylim(18, 0)
plt.title('Features Correlation Matrix Heatmap', size=16)
Text(0.5, 1.0, 'Features Correlation Matrix Heatmap')
We can see many features are correlated linearly, such as:
I'll try removing these and see if we can get a streamlined set of features.
X_train = X_train.drop(['funded_amnt', 'funded_amnt_inv', 'installment'], axis=1) # loan_amnt
X_train = X_train.drop(['mths_since_last_record'], axis=1) # pub_rec
X_train = X_train.drop(['out_prncp_inv'], axis=1) # out_prncp
X_train = X_train.drop(['total_pymnt', 'total_pymnt_inv'], axis=1) # total_rec_prncp
X_train = X_train.drop(['collection_recovery_fee'], axis=1) # recoveries
X_train = X_train.drop(['revol_balOverIncome', 'totalPymntOverIncome', 'totalPymntInvOverIncome', 'totalRecPrncpOverIncome', 'totalRecIncOverIncome'], axis=1) # loanAmntOverIncome
X_train = X_train.drop(['installmentOverLoanAmnt'], axis=1) # term
mask = np.triu(X_train[list(X_train.columns)].corr())
plt.figure(figsize=(25,16))
sns.heatmap(X_train[list(X_train.columns)].corr(), annot=True, fmt='.2g', vmin=-1, vmax=1, center=0, cmap='coolwarm', mask=mask)
# plt.ylim(18, 0)
plt.title('Features Correlation Matrix Heatmap', size=16)
Text(0.5, 1.0, 'Features Correlation Matrix Heatmap')
pca_ = PCA().fit(X_train)
X_pca = PCA().fit_transform(X_train)
pca_evr = pca_.explained_variance_ratio_
cumsum_ = np.cumsum(pca_evr)
# Get dimensions where var >= 95%
dim_95 = np.argmax(cumsum_ >= 0.95) + 1
instances_, dims_ = X_train.shape
# check shape of X
if dims_ > instances_:
print("WARNING: number of features greater than number of instances.")
dimensions = list(range(1, instances_+1))
else:
dimensions = list(range(1, dims_+1))
# Print report
print(f"You can reduce from {dims_} to {dim_95} dimensions while retaining 95% of variance.")
You can reduce from 20 to 16 dimensions while retaining 95% of variance.
pca_16 = PCA(n_components=16).fit(X_train)
X_pca_16 = PCA(n_components=16).fit_transform(X_train)
PCAdf = pd.DataFrame(pca_16.components_, columns=list(X_train.columns))
# Load the example flights dataset and convert to long-form
feat_cor = PCAdf.T
# Draw a heatmap with the numeric values in each cell
f, ax = plt.subplots(figsize=(15, 15))
sns.heatmap(feat_cor, annot=True, linewidths=.5, ax=ax, cmap='coolwarm')
plt.title('PCA Dimension Components Matrix Heatmap', size=16)
Text(0.5, 1.0, 'PCA Dimension Components Matrix Heatmap')
feat_cor.apply(lambda x: print(x.sort_values(ascending=False).head(5)), axis=0)
loan_amnt 0.491383 total_rec_int 0.447772 out_prncp 0.412221 term 0.296014 revol_bal 0.284829 Name: 0, dtype: float64 last_pymnt_amnt 0.560076 total_rec_prncp 0.542091 total_acc 0.246680 open_acc 0.213005 annual_inc 0.114309 Name: 1, dtype: float64 open_acc 0.551213 total_acc 0.538018 dti 0.113655 revol_bal 0.108285 mths_since_last_delinq 0.103607 Name: 2, dtype: float64 annual_inc 0.276965 out_prncp 0.267413 loan_amnt 0.194992 revol_bal 0.140314 loanAmntOverIncome -0.002490 Name: 3, dtype: float64 revol_util 0.624533 revol_bal 0.477545 annual_inc 0.268811 delinq_2yrs 0.141519 dti 0.109721 Name: 4, dtype: float64 delinq_2yrs 0.488972 annual_inc 0.374840 total_rec_late_fee 0.334446 recoveries 0.219888 mths_since_last_delinq 0.197932 Name: 5, dtype: float64 delinq_2yrs 0.369213 pub_rec 0.244885 recoveries 0.203391 total_rec_late_fee 0.178701 dti 0.120163 Name: 6, dtype: float64 recoveries 0.667965 total_rec_late_fee 0.368561 pub_rec 0.259649 mths_since_last_delinq 0.107204 revol_bal 0.105065 Name: 7, dtype: float64 pub_rec 0.580689 emp_length 0.574274 annual_inc 0.249403 revol_util 0.212957 loanAmntOverIncome 0.157230 Name: 8, dtype: float64 loanAmntOverIncome 0.983359 mths_since_last_delinq 0.051820 total_rec_late_fee 0.051457 delinq_2yrs 0.040067 dti 0.023447 Name: 9, dtype: float64 recoveries 0.582447 term 0.142224 delinq_2yrs 0.085341 loanAmntOverIncome 0.047750 last_pymnt_amnt 0.032539 Name: 10, dtype: float64 emp_length 0.762597 total_rec_late_fee 0.222641 dti 0.188678 recoveries 0.181077 loan_amnt 0.058533 Name: 11, dtype: float64 delinq_2yrs 0.447137 mths_since_last_delinq 0.414639 pub_rec 0.403159 dti 0.304807 loan_amnt 0.208964 Name: 12, dtype: float64 annual_inc 0.688497 dti 0.568426 term 0.097892 int_rate 0.083257 emp_length 0.036538 Name: 13, dtype: float64 term 0.732542 revol_bal 0.304663 delinq_2yrs 0.196178 mths_since_last_delinq 0.135775 last_pymnt_amnt 0.122975 Name: 14, dtype: float64 revol_bal 0.692853 int_rate 0.254645 mths_since_last_delinq 0.129626 total_rec_int 0.088147 delinq_2yrs 0.061726 Name: 15, dtype: float64
0 None 1 None 2 None 3 None 4 None 5 None 6 None 7 None 8 None 9 None 10 None 11 None 12 None 13 None 14 None 15 None dtype: object
By running PCA, the cumsum shows that we can retain 95% of the variance with only 16 dimensions rather than 21. So by running PCA again with only 16 dimensions and checking on the components of each dimension, we can see what features are connected. Here's what we can see:
# Random sample of 10,000 out of ~510,000
np.random.seed(123)
indices = np.random.choice(data.shape[0],10000)
X = X_train.iloc[indices] / 255.0
y = y_train.iloc[indices]
X = X.reset_index(drop=True)
y = y.reset_index(drop=True)
print(X.shape, y.shape)
(10000, 20) (10000,)
The clusters we want should be filled with borrowers of similar numerical loan grade, which we will validate using the numerical loan grades we set aside in y. The higher the percentage of borrowers that have the most frequently occuring numerical loan grade in each and every cluster, the better the clustering application.
For example, consider a cluster with 100 borrowers. If 30 borrowers have the numerical grade of 0, 25 borrowers have a loan grade of 1, 20 borrowers have a loan grade of 2, and the remaining borrowers have loan grades ranging from 3 to 7, we would say that the cluster has a 30% accuracy, given that the most frequently occuring lonan grade for that cluster applies to just 30% of the borrowers in that cluster.
To analyze future clusters:
def analyzeCluster(clusterDF, labelsDF):
countByCluster = pd.DataFrame(data=clusterDF['cluster'].value_counts())
countByCluster.reset_index(inplace=True, drop=False)
countByCluster.columns = ['cluster', 'clusterCount']
preds = pd.concat([labelsDF, clusterDF], axis=1)
preds.columns = ['trueLabel', 'cluster']
countByLabel = pd.DataFrame(data=preds.groupby('trueLabel').count())
countMostFreq = pd.DataFrame(data=preds.groupby('cluster').agg(lambda x : x.value_counts().iloc[0]))
countMostFreq.reset_index(inplace=True, drop=False)
countMostFreq.columns = ['cluster', 'countMostFrequent']
accuracyDF = countMostFreq.merge(countByCluster, left_on='cluster', right_on='cluster')
overallAccuracy = accuracyDF.countMostFrequent.sum()/accuracyDF.clusterCount.sum()
accuracyByLabel = accuracyDF.countMostFrequent/accuracyDF.clusterCount
return countByCluster, countByLabel, countMostFreq, accuracyDF, overallAccuracy, accuracyByLabel
n_clusters = 10
n_init = 10
max_iter = 300
tol = 0.0001
random_state = 123
cluster_range = range(2,36)
kmeans = KMeans(n_clusters=n_clusters,
n_init=n_init,
max_iter=max_iter,
tol=tol,
random_state=random_state)
# Inertia
kMeans_inertia = pd.DataFrame(data=[],
index=cluster_range,
columns=['inertia'])
# Overall Accuracy
overallAccuracy_kMeansDF = pd.DataFrame(data=[],
index=cluster_range,
columns=['overallAccuracy'])
# ARI
ari_KMeansDF = pd.DataFrame(data=[],
index=cluster_range,
columns=['ari'])
# Silhouette Score
ss_KMeansDF = pd.DataFrame(data=[],
index=cluster_range,
columns=['ss'])
for n_clusters in cluster_range:
kmeans = KMeans(n_clusters=n_clusters,
n_init=n_init,
max_iter=max_iter,
tol=tol,
random_state=random_state)
kmeans.fit(X)
kMeans_inertia.loc[n_clusters] = kmeans.inertia_
cluster = kmeans.predict(X)
cluster_df = pd.DataFrame(data=cluster,
index=X.index,
columns=['cluster'])
countByCluster_kMeans, countByLabel_kMeans, \
countMostFreq_kMeans, accuracyDF_kMeans, \
overallAccuracy_kMeans, accuracyByLabel_kMeans = \
analyzeCluster(cluster_df, y)
overallAccuracy_kMeansDF.loc[n_clusters] = overallAccuracy_kMeans
ari_KMeansDF.loc[n_clusters] = adjusted_rand_score(y, cluster)
ss_KMeansDF.loc[n_clusters] = silhouette_score(X, cluster, metric='euclidean')
overallAccuracy_kMeansDF.plot()
<AxesSubplot:>
kMeans_inertia.plot()
<AxesSubplot:>
ari_KMeansDF.plot()
<AxesSubplot:>
ss_KMeansDF.plot()
<AxesSubplot:>
accuracyByLabel_kMeans
0 0.417886 1 0.356923 2 0.360938 3 0.750000 4 0.503448 5 0.404110 6 0.624506 7 0.718053 8 0.307692 9 0.333333 10 0.529412 11 0.401786 12 0.410072 13 1.000000 14 0.387097 15 0.527421 16 0.480000 17 0.500000 18 0.550000 19 0.342193 20 0.588477 21 0.478873 22 0.617357 23 0.505882 24 0.294821 25 0.273684 26 0.344262 27 0.664286 28 0.394737 29 0.368280 30 0.342342 31 0.380952 32 0.400000 33 0.399390 34 0.457082 dtype: float64
As we can see from overallAccuracy, the accuracy is best around 30 clusters and levels out there at aproximately 49% accuracy.
Inertia drops very gradually and consistently so there's never really an elbow to determine a clear number of clusters.
ARI shows a huge spike around 13-14 clusters. Silhoette Score shows a drop around 7 then flattens out, although the difference is not as dramatic.
By focusing on ARI and acknowledging the other evaluations, let us try 13 clusters.
n_clusters = 13
n_init = 10
max_iter = 300
tol = 0.0001
random_state = 123
kmeans = KMeans(n_clusters=n_clusters,
n_init=n_init,
max_iter=max_iter,
tol=tol,
random_state=random_state)
kmeans.fit(X)
kMeans_inertia.loc[n_clusters] = kmeans.inertia_
cluster = kmeans.predict(X)
cluster_df = pd.DataFrame(data=cluster,
index=X.index,
columns=['cluster'])
countByCluster_kMeans, countByLabel_kMeans, \
countMostFreq_kMeans, accuracyDF_kMeans, \
overallAccuracy_kMeans, accuracyByLabel_kMeans = \
analyzeCluster(cluster_df, y)
print('overallAccuracy_kMeans: ', overallAccuracy_kMeans)
print('ARI: ', adjusted_rand_score(y, cluster))
print('Silhouette Score: ', silhouette_score(X, cluster, metric='euclidean'))
overallAccuracy_kMeans: 0.4524 ARI: 0.09279624324041397 Silhouette Score: 0.11465279039276895
accuracyByLabel_kMeans
0 0.314159 1 0.400445 2 0.382230 3 0.386991 4 0.750000 5 0.526316 6 0.447368 7 0.617886 8 1.000000 9 0.371113 10 0.610671 11 0.400000 12 0.291498 dtype: float64
accuracyDF_kMeans
| cluster | countMostFrequent | clusterCount | |
|---|---|---|---|
| 0 | 0 | 284 | 904 |
| 1 | 1 | 360 | 899 |
| 2 | 2 | 641 | 1677 |
| 3 | 3 | 589 | 1522 |
| 4 | 4 | 3 | 4 |
| 5 | 5 | 40 | 76 |
| 6 | 6 | 17 | 38 |
| 7 | 7 | 608 | 984 |
| 8 | 8 | 1 | 1 |
| 9 | 9 | 370 | 997 |
| 10 | 10 | 1465 | 2399 |
| 11 | 11 | 2 | 5 |
| 12 | 12 | 144 | 494 |
We see why certain clusters were 100% accurate: they're cluster-size of 1.
Starting point to build a clustering application to automatically assign new borrowers that apply for a Lending Club loan into a pre-existing group based on how similar they are to the other borrowers.
cluster = kmeans.predict(X)
cluster_df = pd.DataFrame(data=cluster,
index=X.index,
columns=['cluster'])
kmeans_df = X
kmeans_df['cluster'] = cluster
len(kmeans_df.columns)
21
def plot_cluster(y, x='cluster', data=kmeans_df, ax=ax):
sns.stripplot(data=data, x=x, y=y, ax=ax)
sns.boxplot(data=data, x=x, y=y, ax=ax)
ax.set_xlabel(x)
ax.set_ylabel(y);
ax.set_title(y, size=16)
fig, axes = plt.subplots(20, 1, figsize=(10,105))
for i in range(len(axes)):
plot_cluster(y=kmeans_df.columns[i], ax=axes[i])
pub_rec (Number of derogatory public records, most likely bankrupcy filing)loan_amnt (Listed amount of the loan applied for by the borrower)int_rate (Interest Rate on the loan)The cluster is for people who filed for bankrupcy multiple times, thus generally low loan amounts and low interest rates.
dim_cols = ['pub_rec', 'loan_amnt', 'int_rate']
fig, axes = plt.subplots(1, len(dim_cols), figsize=(len(dim_cols)*7,5))
for i in range(len(axes)):
plot_cluster(y=dim_cols[i], ax=axes[i])
total_rec_int (Interest received to date)loan_amnt (listed amount of the loan applied for by the borrower)total_rec_late_fee (Late fees received to date)The cluster is for people who pays a lot of interest, essentially due to high loans amount and late payments.
dim_cols = ['total_rec_int', 'loan_amnt', 'total_rec_late_fee']
fig, axes = plt.subplots(1, len(dim_cols), figsize=(len(dim_cols)*7,5))
for i in range(len(axes)):
plot_cluster(y=dim_cols[i], ax=axes[i])
revol_util (Revolving line utilization rate, or the amount of credit the borrower is using relative to all available revolving credit.)delinq_2_yrs (The number of 30+ days past-due incidences of delinquency in the borrower's credit file for the past 2 years)loan_amnt (listed amount of the loan applied for by the borrower)total_rec_int (Late fees received to date)This cluster consists of users who use up a higher percentage of their credit limit (revolving line utilization rate), increasing their chances of delinquency (non-payment count), thus only being accepted for lower loan amounts, amounting to less interest accumulation, despite their increased chances of missing payments.
dim_cols = ['revol_util', 'delinq_2yrs', 'loan_amnt', 'total_rec_int']
fig, axes = plt.subplots(1, len(dim_cols), figsize=(len(dim_cols)*7,5))
for i in range(len(axes)):
plot_cluster(y=dim_cols[i], ax=axes[i])
loan_amnt (listed amount of the loan applied for by the borrower)delinq_2yrs (The number of 30+ days past-due incidences of delinquency in the borrower's credit file for the past 2 years)total_rec_late_fee (Late fees received to date)This cluster, normal borrowers who occasionally miss payments. They are overall average with every category, gets average loan amount, but occasionally misses payments thus has slightly high fees.
dim_cols = ['loan_amnt', 'delinq_2yrs', 'total_rec_late_fee']
fig, axes = plt.subplots(1, len(dim_cols), figsize=(len(dim_cols)*7,5))
for i in range(len(axes)):
plot_cluster(y=dim_cols[i], ax=axes[i])
annual_inc (The self-reported annual income provided by the borrower during registration.)open_acc (The number of open credit lines in the borrower's credit file.)total_acc (The total number of credit lines currently in the borrower's credit file)delinq_2yrs (The number of 30+ days past-due incidences of delinquency in the borrower's credit file for the past 2 years)mths_since_last_delinq (The number of months since the borrower's last delinquency.)This cluster of 4 has insane amount of income. Surprisingly, they are also very good with their money as they open the least number of accounts and never miss payments.
dim_cols = ['annual_inc', 'open_acc', 'total_acc', 'delinq_2yrs', 'mths_since_last_delinq']
fig, axes = plt.subplots(1, len(dim_cols), figsize=(len(dim_cols)*7,5))
for i in range(len(axes)):
plot_cluster(y=dim_cols[i], ax=axes[i])
revol_bal (Total credit revolving balance)int_rate (Interest Rate on the loan)This cluster has a lot of total credit revolvinng balance and on average, a low interest rate.
dim_cols = ['revol_bal', 'int_rate']
fig, axes = plt.subplots(1, len(dim_cols), figsize=(len(dim_cols)*7,5))
for i in range(len(axes)):
plot_cluster(y=dim_cols[i], ax=axes[i])
recoveries (Post charge off gross recovery)int_rate (Interest Rate on the loan)This cluster has a lot of post charge-off recoveries and abnormally high interest rates.
dim_cols = ['recoveries', 'int_rate']
fig, axes = plt.subplots(1, len(dim_cols), figsize=(len(dim_cols)*7,5))
for i in range(len(axes)):
plot_cluster(y=dim_cols[i], ax=axes[i])
total_rec_late_fee (Late fees received to date)out_prncp (Remaining outstanding principal for total amount funded)total_rec_int (Interest received to date)This cluster is similar to Cluster 1, but without the high-interest payment, meaning they keep their insterest minimal, but have a lot of loan left to pay off.
dim_cols = ['total_rec_late_fee', 'out_prncp', 'total_rec_int']
fig, axes = plt.subplots(1, len(dim_cols), figsize=(len(dim_cols)*7,5))
for i in range(len(axes)):
plot_cluster(y=dim_cols[i], ax=axes[i])
recoveries (post charge off gross recovery)emp_length (Employment length in years. Possible values are between 0 and 10)The one customer who got a massive previously charged-off loan recovered from someone who is not employed.
dim_cols = ['recoveries', 'emp_length']
fig, axes = plt.subplots(1, len(dim_cols), figsize=(len(dim_cols)*7,5))
for i in range(len(axes)):
plot_cluster(y=dim_cols[i], ax=axes[i])
open_acc (The number of open credit lines in the borrower's credit file.)total_acc (The total number of credit lines currently in the borrower's credit file)This cluster has abnormal amount of open accounts.
dim_cols = ['open_acc', 'total_acc']
fig, axes = plt.subplots(1, len(dim_cols), figsize=(len(dim_cols)*7,5))
for i in range(len(axes)):
plot_cluster(y=dim_cols[i], ax=axes[i])
loan_amnt (The listed amount of the loan applied for by the borrower.)int_rate (Interest Rate on the loan)term (The number of payments on the loan. Values are in months and can be either 36 or 60.)delinq_2yrs (The number of 30+ days past-due incidences of delinquency in the borrower's credit file for the past 2 years)This cluster takes really low loan amounts with low interest rates with low term-rate.
dim_cols = ['loan_amnt', 'int_rate', 'term', 'delinq_2yrs']
fig, axes = plt.subplots(1, len(dim_cols), figsize=(len(dim_cols)*7,5))
for i in range(len(axes)):
plot_cluster(y=dim_cols[i], ax=axes[i])
dti ("Debt-to-Income": A ratio calculated using the borrower’s total monthly debt payments on the total debt obligations, excluding mortgage and the requested LC loan, divided by the borrower’s self-reported monthly income.)annual_inc (The self-reported annual income provided by the borrower during registration.)This cluster of 5 has the highest DTI (Debt-to-Income) ratio, unsurprisingly with abysmal annual income.
dim_cols = ['dti', 'annual_inc']
fig, axes = plt.subplots(1, len(dim_cols), figsize=(len(dim_cols)*7,5))
for i in range(len(axes)):
plot_cluster(y=dim_cols[i], ax=axes[i])
total_rec_prncp (Principal received to date)last_pymnt_amnt (Last total payment amount received)This cluster gets higher-than average loans and pays off the principle quickly, with lump sum on final payment.
dim_cols = ['loan_amnt', 'total_rec_prncp', 'last_pymnt_amnt']
fig, axes = plt.subplots(1, len(dim_cols), figsize=(len(dim_cols)*7,5))
for i in range(len(axes)):
plot_cluster(y=dim_cols[i], ax=axes[i])
Z = fastcluster.linkage_vector(X, method='ward', metric='euclidean')
Z_df = pd.DataFrame(data=Z, columns=['clusterOne',
'clusterTwo',
'distance',
'newClusterSize'])
distance_threshold = 0.175
clusters = fcluster(Z, distance_threshold, criterion = 'distance')
X_hierClustered = pd.DataFrame(data=clusters,
index=X_PCA_df.index,
columns=['cluster'])
print("Number of distinct clusters: ", len(X_hierClustered['cluster'].unique()))
Number of distinct clusters: 22
countByCluster_hierClust, countByLabel_hierClust, countMostFreq_hierClust, \
accuracyDF_hierClust, overallAccuracy_hierClust, accuracyByLabel_hierClust = \
analyzeCluster(X_hierClustered, y)
print('Overall accuracy from hierarchical clustering: ', overallAccuracy_hierClust)
print('ARI: ', adjusted_rand_score(y, clusters))
print('Silhouette Score: ', silhouette_score(X, clusters, metric='euclidean'))
Overall accuracy from hierarchical clustering: 0.4608 ARI: 0.07169597463439129 Silhouette Score: 0.3114058391693978
accuracyByLabel_hierClust
0 0.314159 1 0.400455 2 0.400000 3 0.382586 4 0.388451 5 0.431373 6 0.403800 7 0.370017 8 0.500000 9 0.619342 10 1.000000 11 0.447368 12 0.750000 13 0.526316 14 0.400000 15 0.373418 16 0.273810 17 0.387097 18 0.468563 19 0.409396 20 0.529900 21 0.637730 dtype: float64
Again, we can see percentage ranges from 27% to 100%, with the 100% because it's a cluster-size of 1.
accuracyDF_hierClust
| cluster | countMostFrequent | clusterCount | |
|---|---|---|---|
| 0 | 1 | 284 | 904 |
| 1 | 2 | 352 | 879 |
| 2 | 3 | 8 | 20 |
| 3 | 4 | 145 | 379 |
| 4 | 5 | 444 | 1143 |
| 5 | 6 | 44 | 102 |
| 6 | 7 | 170 | 421 |
| 7 | 8 | 427 | 1154 |
| 8 | 9 | 6 | 12 |
| 9 | 10 | 602 | 972 |
| 10 | 11 | 1 | 1 |
| 11 | 12 | 17 | 38 |
| 12 | 13 | 3 | 4 |
| 13 | 14 | 40 | 76 |
| 14 | 15 | 2 | 5 |
| 15 | 16 | 59 | 158 |
| 16 | 17 | 92 | 336 |
| 17 | 18 | 12 | 31 |
| 18 | 19 | 313 | 668 |
| 19 | 20 | 122 | 298 |
| 20 | 21 | 319 | 602 |
| 21 | 22 | 1146 | 1797 |
hier_df = X
hier_df['cluster'] = clusters
fig, axes = plt.subplots(20, 1, figsize=(10,105))
for i in range(len(axes)):
plot_cluster(y=hier_df.columns[i], ax=axes[i], data=hier_df)
min_cluster_size = 20
min_samples = 20
alpha = 1.0
cluster_selection_method = 'leaf'
hdb = hdbscan.HDBSCAN(min_cluster_size=min_cluster_size,
min_samples=min_samples,
alpha=alpha,
cluster_selection_method=cluster_selection_method)
X_hdbscanClustered = hdb.fit_predict(X)
X_hdbscanClustered_df = pd.DataFrame(data=X_hdbscanClustered,
index=X.index,
columns=['cluster'])
countByCluster_hdbscan, countByLabel_hdbscan, \
countMostFreq_hdbscan, accuracyDF_hdbscan, \
overallAccuracy_hdbscan, accuracyByLabel_hdbscan = \
analyzeCluster(X_hdbscanClustered_df, y)
print('Overall accuracy from hierarchical clustering: ', overallAccuracy_hierClust)
print('ARI: ', adjusted_rand_score(y, X_hdbscanClustered))
print('Silhouette Score: ', silhouette_score(X, X_hdbscanClustered, metric='euclidean'))
Overall accuracy from hierarchical clustering: 0.4608 ARI: 0.061268351488627436 Silhouette Score: 0.6342216209546551
accuracyByLabel_hdbscan
0 0.336927 1 0.447368 2 0.519481 3 0.373418 4 0.273810 5 0.529900 6 0.637730 7 0.409396 8 0.387097 9 0.468563 10 0.431373 11 0.388451 12 0.382586 13 0.403800 14 0.370017 15 0.500000 16 0.382311 17 0.469880 18 0.463415 19 0.742857 20 0.711864 dtype: float64
accuracyDF_hdbscan
| cluster | countMostFrequent | clusterCount | |
|---|---|---|---|
| 0 | -1 | 375 | 1113 |
| 1 | 0 | 17 | 38 |
| 2 | 1 | 40 | 77 |
| 3 | 2 | 59 | 158 |
| 4 | 3 | 92 | 336 |
| 5 | 4 | 319 | 602 |
| 6 | 5 | 1146 | 1797 |
| 7 | 6 | 122 | 298 |
| 8 | 7 | 12 | 31 |
| 9 | 8 | 313 | 668 |
| 10 | 9 | 44 | 102 |
| 11 | 10 | 444 | 1143 |
| 12 | 11 | 145 | 379 |
| 13 | 12 | 170 | 421 |
| 14 | 13 | 427 | 1154 |
| 15 | 14 | 15 | 30 |
| 16 | 15 | 268 | 701 |
| 17 | 16 | 39 | 83 |
| 18 | 17 | 190 | 410 |
| 19 | 18 | 78 | 105 |
| 20 | 19 | 252 | 354 |
From the cluster dataframe, we can see almost a quarter of the data was not grouped at all. ARI and Silhouette Scores are a smidgen higher compared to previous methods. Pretty sure results can be improved.
hdb_df = X
hdb_df['cluster'] = X_hdbscanClustered
fig, axes = plt.subplots(20, 1, figsize=(10,105))
for i in range(len(axes)):
plot_cluster(y=hier_df.columns[i], ax=axes[i], data=hdb_df)
I chose 3D as 2D plots felt very restrictive.
# Color by Cluster
def dim_reduction_df(proj_3d, cluster_array):
df = pd.DataFrame(data=proj_3d, columns=['x', 'y', 'z'])
df['cluster'] = cluster_array
return df
def color_by_cluster(cluster_df, proj_3d, cluster_name, dim_red_name):
PLOT = go.Figure()
df = dim_reduction_df(proj_3d, cluster_array=cluster_df['cluster'])
for C in list(cluster_df.cluster.unique()):
PLOT.add_trace(go.Scatter3d(x = df[df.cluster == C]['x'],
y = df[df.cluster == C]['y'],
z = df[df.cluster == C]['z'],
mode = 'markers',
marker_size = 3,
marker_line_width = 1,
name = 'Cluster ' + str(C)))
PLOT.update_layout(autosize = True,
showlegend = True,
title = '{} with {}'.format(cluster_name, dim_red_name),
scene = dict(xaxis=dict(title = 'x'),
yaxis=dict(title = 'y'),
zaxis=dict(title = 'z')),
font = dict(family = 'Gilroy',
color = 'black',
size = 12))
return PLOT
Xpca_3 = PCA(n_components=3, random_state=123).fit_transform(X)
#3D Plot of PCA
chart3d = go.Scatter3d(
x = Xpca_3[:,0],# First Column
y = Xpca_3[:,1],
z = Xpca_3[:,2],
mode='markers',
marker=dict(
color = y,
size = 5,
line=dict(color = y),
opacity = 0.7
)
)
layout = go.Layout(
title = 'Dimensionality Reduction with PCA',
scene = dict(
xaxis = dict(title = 'Dim 1'),
yaxis = dict(title = 'Dim 2'),
zaxis = dict(title = 'Dim 3')
)
)
# Show the chart
fig = go.Figure(data=chart3d, layout=layout)
py.offline.iplot(fig)
# PCA colored with Kmeans
color_by_cluster(kmeans_df, Xpca_3, 'Kmeans', 'PCA')
color_by_cluster(hier_df, Xpca_3, 'Heirarchical', 'PCA')
color_by_cluster(hdb_df, Xpca_3, 'HDBScan', 'PCA')
time_start = time.time()
Xtsne_3 = TSNE(n_components=3, verbose=1, perplexity=40, n_iter=300, random_state=123).fit_transform(X)
print('t-SNE done! Time elapsed: {} seconds'.format(time.time()-time_start))
[t-SNE] Computing 121 nearest neighbors... [t-SNE] Indexed 10000 samples in 0.062s... [t-SNE] Computed neighbors for 10000 samples in 0.823s... [t-SNE] Computed conditional probabilities for sample 1000 / 10000 [t-SNE] Computed conditional probabilities for sample 2000 / 10000 [t-SNE] Computed conditional probabilities for sample 3000 / 10000 [t-SNE] Computed conditional probabilities for sample 4000 / 10000 [t-SNE] Computed conditional probabilities for sample 5000 / 10000 [t-SNE] Computed conditional probabilities for sample 6000 / 10000 [t-SNE] Computed conditional probabilities for sample 7000 / 10000 [t-SNE] Computed conditional probabilities for sample 8000 / 10000 [t-SNE] Computed conditional probabilities for sample 9000 / 10000 [t-SNE] Computed conditional probabilities for sample 10000 / 10000 [t-SNE] Mean sigma: 0.003668 [t-SNE] KL divergence after 250 iterations with early exaggeration: 66.589195 [t-SNE] KL divergence after 300 iterations: 2.185263 t-SNE done! Time elapsed: 34.56090807914734 seconds
chart3d_tsne = go.Scatter3d(
x = Xtsne_3[:,0],# First Column
y = Xtsne_3[:,1],
z = Xtsne_3[:,2],
mode='markers',
marker=dict(
color = y,
size = 5,
line=dict(color = y),
opacity = 0.7
)
)
layout = go.Layout(
title = 'Dimensionality Reduction with TSNE',
scene = dict(
xaxis = dict(title = 'Dim 1'),
yaxis = dict(title = 'Dim 2'),
zaxis = dict(title = 'Dim 3')
)
)
# Show the chart
fig = go.Figure(data=chart3d_tsne, layout=layout)
py.offline.iplot(fig)
color_by_cluster(kmeans_df, Xtsne_3, 'KMeans', 'TSNE')
color_by_cluster(hier_df, Xtsne_3, 'Hierarchical', 'TSNE')
color_by_cluster(hdb_df, Xtsne_3, 'HDBScan', 'TSNE')
umap_3d = UMAP(n_components=3, init='random', random_state=123)
proj_3d = umap_3d.fit_transform(X)
fig_3d = px.scatter_3d(
proj_3d, x=0, y=1, z=2,
color=y, labels={'color': 'Grade'}
)
fig_3d.update_traces(marker_size=3)
fig_3d.show()
color_by_cluster(kmeans_df, proj_3d, 'KMeans', 'UMAP')
color_by_cluster(hier_df, proj_3d, 'Hierarchical', 'UMAP')
color_by_cluster(hdb_df, proj_3d, 'HDBScan', 'UMAP')
Based on the complexity of the data, an extra dimension goes a long way. Out of all the visualizations, I prefer UMAP 3D as it does a better job of grouping datapoints, not necessairly by grade, but by actual clusters.
From the visual 3D graphs, most of the clustering makes sense. Compared to the ARI and SS evaluations, HDBScan seemed to perform the best.